29 research outputs found

    Machine learning applied to electoral night forecasting

    Get PDF
    The current project explores electoral data from Spanish national elections from 2000 to 2011. Using these, it is developed and tested an algorithm to predict election results with a small error. The analysed prediction is an Electoral Night Forecasting procedure, performed during the election night. Specifically, the implementation starts with the counting of the first votes and finishes with the publication of the final results. Differently from other widespread calculations, the basis of the study is not sociodemographic data. Machine Learning techniques, like clustering or PCA, are used in conjunction with electoral data from past elections and data from the current incoming votes. There are two distinguishable phases in the prediction: clustering of historical data to train the system and the prediction itself. Within the clustering phase, polling districts are grouped according to the similarity between their voting behaviour (clustering implementation depends strongly on the function chosen to calculate this similarity). Predictions of the final results are given as voting data is arriving. It is assumed that not computed polling districts have similar results to those already computed of the same cluster. With the previous assumption, predictions are improved. In addition, the error due to the bias of the early received results is reduced. The prediction model chosen to develop this approach was tested with great outcomes in the 2004 South African elections. A data study is conducted prior to the prediction, in order to find the optimal clustering type and number of clusters. Results obtained are not indicative enough to choose a good setting of the model. Consequently, the quest of the best parameters is done by the very same computation of predictions. For any tested configuration, the prediction improves the results thrown by pure counting of incoming votes. Forecasted results are studied according to the percentage of votes arrived. The study reveals that fewer clusters are a better option for lower votes arrived. On the contrary, a larger number of clusters is more accurate for a large number of incoming votes. An alternative implementation of the algorithm using PCA (Principal Component Analysis) is proposed. By using it, it is expected to lessen data volume and consequently, computational cost and execution time. By using PCA with training data, most of the forecasting results are notably enhanced. With this technique, a great behaviour is guaranteed with a large number of clusters, independently from the percentage of votes arrived.Aquest projecte explora les dades electorals dels comicis estatals celebrats a Espanya entre els anys 2000 i 2011. Amb elles s'elabora i testeja un algorisme de predicció de resultats electorals amb un error de predicció petit. La predicció a desenvolupar s'emmarca en l'anomenat Electoral Night Forecasting (ENF), prediccions durant la nit electoral. En concret, s'implementen durant l'escrutini, iniciant-se amb el recompte de vots de les primeres meses electorals i finalitzant amb la publicació de resultats definitius. Els còmputs a realitzar, a diferència d'altres més clàssics i coneguts, no es basen directament en dades sociodemogràfiques. Es combinen tècniques de Machine Learning, com clustering i PCA; amb dades històriques d'eleccions passades i les que s'obtenen del recompte de vots durant les eleccions actuals. Hi ha dues fases ben diferenciades per a desenvolupar la predicció: el clustering de les dades històriques, que serveix com a entrenament del sistema, i la mateixa predicció. A la fase de clustering es fan grups de meses electorals, assignant al mateix conjunt les que tenen resultats similars (aquesta agrupació dependrà en gran mesura de la funció que es fa servir per a calcular la semblança entre resultats). Tal com van arribant les dades durant l'escrutini, es van fent prediccions del resultat final. Aquestes assumeixen que els vots de les meses que encara no han finalitzat el recompte són similars a les ja escrutades del seu mateix clúster. D'aquesta manera es refina la predicció del resultat final i es redueix considerablement l'error provocat per l'arribada de meses amb patró de vot similar als primers estadis de l'escrutini. El model de predicció escollit per a realitzar aquesta aproximació es va testejar a les eleccions sud-africanes del 2004 amb bons resultats. Amb anterioritat a la fase de predicció s'estudien les dades, per a trobar el tipus de clustering i el nombre de clústers òptim a emprar. Els resultats obtinguts no són prou indicatius per a triar una bona configuració per al model descrit. Es passa llavors a cercar els millors paràmetres i nombre de clústers tot comparant-ne la reducció dels seus respectius errors de predicció directament. Per a qualsevol configuració utilitzada, la predicció millora els resultats de l'escrutini pur, mostrant la seva utilitat. S'estudia el comportament de la predicció per a diferents valors de l'escrutini. Amb valors baixos, es troben millors resultats si les agrupacions tenen un menor nombre de clústers. Per a valors superiors d'escrutini, és millor fer servir un nombre més gran de clústers. Es proposa també un càlcul alternatiu de l'algorisme fent servir PCA (Principal Component Analysis) per a alleugerir el volum de dades implicat en el càlcul de clústers i així obtenir temps d'execució més reduïts, comparant si afecta al resultat final i als paràmetres òptims. Amb el processat de les dades d'entrenament amb PCA, el comportament del sistema millora notablement per la majoria de casos estudiats. Amb PCA obtenim també resultats òptims (o quasi òptims en el pitjor dels casos) amb un nombre de clústers grans, independentment del percentatge d'escrutini computat.Este proyecto explora los datos electorales de los comicios estatales celebrados en España entre los años 2000 y 2011. Con ellos se elabora y testea un algoritmo de predicción de resultados electorales con un error de predicción pequeño. La predicción a desarrollar se enmarca en el llamado Electoral Night Forecasting (ENF), predicciones durante la noche electoral. En concreto, se implementan durante el escrutinio, iniciándose con el recuento de votos de las primeras mesas electorales y finalizando con la publicación de resultados definitivos. Los cómputos a realizar, a diferencia de otros más clásicos y conocidos, no se basan directamente en datos sociodemográficos. Se combinan técnicas de Machine Learning, como clustering y PCA; con datos históricos de elecciones pasadas y las que se obtienen del recuento de votos durante las elecciones actuales. Hay dos fases bien diferenciadas para desarrollar la predicción: el clustering de los datos históricos, que sirve como entrenamiento del sistema, y la misma predicción. En la fase de clustering se hacen grupos de mesas electorales, asignando al mismo conjunto las que tienen resultados similares (esta agrupación dependerá en gran medida de la función que se utiliza para calcular la semejanza entre resultados). Tal y como van llegando los datos durante el escrutinio, se van haciendo predicciones del resultado final. Éstas asumen que los votos de las mesas que aún no han finalizado el recuento son similares a las ya escrutadas de su mismo clúster. De esta manera se refina la predicción del resultado final y se reduce considerablemente el error provocado por la llegada de mesas con patrón de voto similar en los primeros estadios del escrutinio. El modelo de predicción escogido para realizar esta aproximación se implantó en las elecciones sudafricanas de 2004 con buenos resultados. Con anterioridad a la fase de predicción se estudian los datos, para encontrar el tipo de clustering y el número de clústeres óptimo a emplear. Los resultados obtenidos no son suficientemente indicativos para elegir una buena configuración para el modelo descrito. Se pasa entonces a buscar los mejores parámetros y número de clústeres comparando directamente la reducción de sus respectivos errores de predicción. Para cualquier configuración utilizada, la predicción mejora los resultados del escrutinio puro, mostrando su utilidad. Se estudia el comportamiento de la predicción para diferentes valores del escrutinio. Con valores bajos, se encuentran mejores resultados si las agrupaciones tienen un menor número de clústeres. Para valores superiores de escrutinio, es mejor usar un mayor número de clústeres. Se propone también un cálculo alternativo del algoritmo utilizando PCA (Principal Component Analysis) para aligerar el volumen de datos implicado en el cálculo de clústeres y así obtener tiempos de ejecución más reducidos, comparando si afecta al resultado final y a los parámetros óptimos. Con el procesado de los datos de entrenamiento con PCA, el comportamiento del sistema mejora notablemente para la mayoría de casos estudiados. Con PCA obtenemos también resultados óptimos (o casi óptimos en el peor de los casos) con un número de clústeres grandes, independientemente del porcentaje de escrutinio computado

    Profiles of Frailty among Older People Users of a Home-Based Primary Care Service in an Urban Area of Barcelona (Spain): An Observational Study and Cluster Analysis

    Get PDF
    Background: The multidimensional assessment of frailty allows stratifying it into degrees; however, there is still heterogeneity in the characteristics of people in each stratum. The aim of this study was to identify frailty profiles of older people users of a home-based primary care service. Methods: We carried out an observational study from January 2018 to January 2021. Participants were all people cared for a home-based primary care service. We performed a cluster analysis by applying a k-means clustering technique. Cluster labeling was determined with the 22 variables of the Frail-VIG index, age, and sex. We computed multiple indexes to assess the optimal number of clusters, and this was selected based on a clinical assessment of the best options. Results: Four hundred and twelve participants were clustered into six profiles. Three of these profiles corresponded to a moderate frailty degree, two to a severe frailty degree and one to a mild frailty degree. In addition, almost 75% of the participants were clustered into three profiles which corresponded to mild and moderate degree of frailty. Conclusions: Different profiles were found within the same degree of frailty. Knowledge of these profiles can be useful in developing strategies tailored to these differentiated care needs

    Polypharmacy Patterns in Multimorbid Older People with Cardiovascular Disease : Longitudinal Study

    Get PDF
    (1) Introduction: Cardiovascular disease is associated with high mortality, especially in older people. This study aimed to characterize the evolution of combined multimorbidity and polypharmacy patterns in older people with different cardiovascular disease profiles. (2) Material and methods: This longitudinal study drew data from the Information System for Research in Primary Care in people aged 65 to 99 years with profiles of cardiovascular multimorbidity. Combined patterns of multimorbidity and polypharmacy were analysed using fuzzy c-means clustering techniques and hidden Markov models. The prevalence, observed/expected ratio, and exclusivity of chronic diseases and/or groups of these with the corresponding medication were described. (3) Results: The study included 114,516 people, mostly men (59.6%) with a mean age of 78.8 years and a high prevalence of polypharmacy (83.5%). The following patterns were identified: Mental, behavioural, digestive and cerebrovascular ; Neuropathy, autoimmune and musculoskeletal ; Musculoskeletal, mental, behavioural, genitourinary, digestive and dermatological ; Non-specific ; Multisystemic ; Respiratory, cardiovascular, behavioural and genitourinary ; Diabetes and ischemic cardiopathy ; and Cardiac. The prevalence of overrepresented health problems and drugs remained stable over the years, although by study end, cohort survivors had more polypharmacy and multimorbidity. Most people followed the same pattern over time; the most frequent transitions were from Non-specific to Mental, behavioural, digestive and cerebrovascular and from Musculoskeletal, mental, behavioural, genitourinary, digestive and dermatological to Non-specific. (4) Conclusions: Eight combined multimorbidity and polypharmacy patterns, differentiated by sex, remained stable over follow-up. Understanding the behaviour of different diseases and drugs can help design individualised interventions in populations with clinical complexity

    Impact of the COVID-19 pandemic on diagnoses of common mental health disorders in adults in Catalonia, Spain : a population-based cohort study

    Get PDF
    To investigate how trends in incidence of anxiety and depressive disorders have been affected by the COVID-19 pandemic. Population-based cohort study. Retrospective cohort study from 2018 to 2021 using the Information System for Research in Primary Care (SIDIAP) database in Catalonia, Spain. 3 640 204 individuals aged 18 or older in SIDIAP on 1 March 2018 with no history of anxiety and depressive disorders. The incidence of anxiety and depressive disorders during the prelockdown period (March 2018-February 2020), lockdown period (March-June 2020) and postlockdown period (July 2020-March 2021) was calculated. Forecasted rates over the COVID-19 periods were estimated using negative binomial regression models based on prelockdown data. The percentage of reduction was estimated by comparing forecasted versus observed events, overall and by sex, age and socioeconomic status. The incidence rates per 100 000 person-months of anxiety and depressive disorders were 151.1 (95% CI 150.3 to 152.0) and 32.3 (31.9 to 32.6), respectively, during the prelockdown period. We observed an increase of 37.1% (95% prediction interval 25.5 to 50.2) in incident anxiety diagnoses compared with the expected in March 2020, followed by a reduction of 15.8% (7.3 to 23.5) during the postlockdown period. A reduction in incident depressive disorders occurred during the lockdown and postlockdown periods (45.6% (39.2 to 51.0) and 22.0% (12.6 to 30.1), respectively). Reductions were higher among women during the lockdown period, adults aged 18-34 years and individuals living in the most deprived areas. The COVID-19 pandemic in Catalonia was associated with an initial increase in anxiety disorders diagnosed in primary care but a reduction in cases as the pandemic continued. Diagnoses of depressive disorders were lower than expected throughout the pandemic

    Soft clustering using real-world data for the identification of multimorbidity patterns in an elderly population: Cross-sectional study in a Mediterranean population

    Get PDF
    The aim of this study was to identify, with soft clustering methods, multimorbidity patterns in the electronic health records of a population =65 years, and to analyse such patterns in accordance with the different prevalence cut-off points applied. Fuzzy cluster analysis allows individuals to be linked simultaneously to multiple clusters and is more consistent with clinical experience than other approaches frequently found in the literature.Peer ReviewedPostprint (published version

    A standardized analytics pipeline for reliable and rapid development and validation of prediction models using observational health data

    Get PDF
    Background and objective: As a response to the ongoing COVID-19 pandemic, several prediction models in the existing literature were rapidly developed, with the aim of providing evidence-based guidance. However, none of these COVID-19 prediction models have been found to be reliable. Models are commonly assessed to have a risk of bias, often due to insufficient reporting, use of non-representative data, and lack of large-scale external validation. In this paper, we present the Observational Health Data Sciences and Informatics (OHDSI) analytics pipeline for patient-level prediction modeling as a standardized approach for rapid yet reliable development and validation of prediction models. We demonstrate how our analytics pipeline and open-source software tools can be used to answer important prediction questions while limiting potential causes of bias (e.g., by validating phenotypes, specifying the target population, performing large-scale external validation, and publicly providing all analytical source code). Methods: We show step-by-step how to implement the analytics pipeline for the question: ‘In patients hospitalized with COVID-19, what is the risk of death 0 to 30 days after hospitalization?’. We develop models using six different machine learning methods in a USA claims database containing over 20,000 COVID-19 hospitalizations and externally validate the models using data containing over 45,000 COVID-19 hospitalizations from South Korea, Spain, and the USA. Results: Our open-source software tools enabled us to efficiently go end-to-end from problem design to reliable Model Development and evaluation. When predicting death in patients hospitalized with COVID-19, AdaBoost, random forest, gradient boosting machine, and decision tree yielded similar or lower internal and external validation discrimination performance compared to L1-regularized logistic regression, whereas the MLP neural network consistently resulted in lower discrimination. L1-regularized logistic regression models were well calibrated. Conclusion: Our results show that following the OHDSI analytics pipeline for patient-level prediction modelling can enable the rapid development towards reliable prediction models. The OHDSI software tools and pipeline are open source and available to researchers from all around the world.</p

    International cohort study indicates no association between alpha-1 blockers and susceptibility to COVID-19 in benign prostatic hyperplasia patients

    Get PDF
    Purpose: Alpha-1 blockers, often used to treat benign prostatic hyperplasia (BPH), have been hypothesized to prevent COVID-19 complications by minimising cytokine storm release. The proposed treatment based on this hypothesis currently lacks support from reliable real-world evidence, however. We leverage an international network of large-scale healthcare databases to generate comprehensive evidence in a transparent and reproducible manner.Methods: In this international cohort study, we deployed electronic health records from Spain (SIDIAP) and the United States (Department of Veterans Affairs, Columbia University Irving Medical Center, IQVIA OpenClaims, Optum DOD, Optum EHR). We assessed association between alpha-1 blocker use and risks of three COVID-19 outcomes-diagnosis, hospitalization, and hospitalization requiring intensive services-using a prevalent-user active-comparator design. We estimated hazard ratios using state-of-the-art techniques to minimize potential confounding, including large-scale propensity score matching/stratification and negative control calibration. We pooled database-specific estimates through random effects meta-analysis.Results: Our study overall included 2.6 and 0.46 million users of alpha-1 blockers and of alternative BPH medications. We observed no significant difference in their risks for any of the COVID-19 outcomes, with our meta-analytic HR estimates being 1.02 (95% CI: 0.92-1.13) for diagnosis, 1.00 (95% CI: 0.89-1.13) for hospitalization, and 1.15 (95% CI: 0.71-1.88) for hospitalization requiring intensive services.Conclusion: We found no evidence of the hypothesized reduction in risks of the COVID-19 outcomes from the prevalent-use of alpha-1 blockers-further research is needed to identify effective therapies for this novel disease.</p

    Renin-angiotensin system blockers and susceptibility to COVID-19:an international, open science, cohort analysis

    Get PDF
    Background: Angiotensin-converting enzyme inhibitors (ACEIs) and angiotensin receptor blockers (ARBs) have been postulated to affect susceptibility to COVID-19. Observational studies so far have lacked rigorous ascertainment adjustment and international generalisability. We aimed to determine whether use of ACEIs or ARBs is associated with an increased susceptibility to COVID-19 in patients with hypertension.Methods: In this international, open science, cohort analysis, we used electronic health records from Spain (Information Systems for Research in Primary Care [SIDIAP]) and the USA (Columbia University Irving Medical Center data warehouse [CUIMC] and Department of Veterans Affairs Observational Medical Outcomes Partnership [VA-OMOP]) to identify patients aged 18 years or older with at least one prescription for ACEIs and ARBs (target cohort) or calcium channel blockers (CCBs) and thiazide or thiazide-like diuretics (THZs; comparator cohort) between Nov 1, 2019, and Jan 31, 2020. Users were defined separately as receiving either monotherapy with these four drug classes, or monotherapy or combination therapy (combination use) with other antihypertensive medications. We assessed four outcomes: COVID-19 diagnosis; hospital admission with COVID-19; hospital admission with pneumonia; and hospital admission with pneumonia, acute respiratory distress syndrome, acute kidney injury, or sepsis. We built large-scale propensity score methods derived through a data-driven approach and negative control experiments across ten pairwise comparisons, with results meta-analysed to generate 1280 study effects. For each study effect, we did negative control outcome experiments using a possible 123 controls identified through a data-rich algorithm. This process used a set of predefined baseline patient characteristics to provide the most accurate prediction of treatment and balance among patient cohorts across characteristics. The study is registered with the EU Post-Authorisation Studies register, EUPAS35296.Findings: Among 1 355 349 antihypertensive users (363 785 ACEI or ARB monotherapy users, 248 915 CCB or THZ monotherapy users, 711 799 ACEI or ARB combination users, and 473 076 CCB or THZ combination users) included in analyses, no association was observed between COVID-19 diagnosis and exposure to ACEI or ARB monotherapy versus CCB or THZ monotherapy (calibrated hazard ratio [HR] 0·98, 95% CI 0·84-1·14) or combination use exposure (1·01, 0·90-1·15). ACEIs alone similarly showed no relative risk difference when compared with CCB or THZ monotherapy (HR 0·91, 95% CI 0·68-1·21; with heterogeneity of &gt;40%) or combination use (0·95, 0·83-1·07). Directly comparing ACEIs with ARBs demonstrated a moderately lower risk with ACEIs, which was significant with combination use (HR 0·88, 95% CI 0·79-0·99) and non-significant for monotherapy (0·85, 0·69-1·05). We observed no significant difference between drug classes for risk of hospital admission with COVID-19, hospital admission with pneumonia, or hospital admission with pneumonia, acute respiratory distress syndrome, acute kidney injury, or sepsis across all comparisons.Interpretation: No clinically significant increased risk of COVID-19 diagnosis or hospital admission-related outcomes associated with ACEI or ARB use was observed, suggesting users should not discontinue or change their treatment to decrease their risk of COVID-19.</p

    Implementation of the COVID-19 vulnerability index across an international network of health care data sets:Collaborative external validation study

    Get PDF
    Background: SARS-CoV-2 is straining health care systems globally. The burden on hospitals during the pandemic could be reduced by implementing prediction models that can discriminate patients who require hospitalization from those who do not. The COVID-19 vulnerability (C-19) index, a model that predicts which patients will be admitted to hospital for treatment of pneumonia or pneumonia proxies, has been developed and proposed as a valuable tool for decision-making during the pandemic. However, the model is at high risk of bias according to the "prediction model risk of bias assessment" criteria, and it has not been externally validated.Objective: The aim of this study was to externally validate the C-19 index across a range of health care settings to determine how well it broadly predicts hospitalization due to pneumonia in COVID-19 cases.Methods: We followed the Observational Health Data Sciences and Informatics (OHDSI) framework for external validation to assess the reliability of the C-19 index. We evaluated the model on two different target populations, 41,381 patients who presented with SARS-CoV-2 at an outpatient or emergency department visit and 9,429,285 patients who presented with influenza or related symptoms during an outpatient or emergency department visit, to predict their risk of hospitalization with pneumonia during the following 0-30 days. In total, we validated the model across a network of 14 databases spanning the United States, Europe, Australia, and Asia.Results: The internal validation performance of the C-19 index had a C statistic of 0.73, and the calibration was not reported by the authors. When we externally validated it by transporting it to SARS-CoV-2 data, the model obtained C statistics of 0.36, 0.53 (0.473-0.584) and 0.56 (0.488-0.636) on Spanish, US, and South Korean data sets, respectively. The calibration was poor, with the model underestimating risk. When validated on 12 data sets containing influenza patients across the OHDSI network, the C statistics ranged between 0.40 and 0.68.Conclusions: Our results show that the discriminative performance of the C-19 index model is low for influenza cohorts and even worse among patients with COVID-19 in the United States, Spain, and South Korea. These results suggest that C-19 should not be used to aid decision-making during the COVID-19 pandemic. Our findings highlight the importance of performing external validation across a range of settings, especially when a prediction model is being extrapolated to a different population. In the field of prediction, extensive validation is required to create appropriate trust in a model.</p
    corecore